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Abstract 

Feature  representations  extracted  from  deep  neural  network- 
based  multilingual  frontends  provide  significant  improvements 
to  speech  recognition  systems  in  low  resource  settings.  To  ef¬ 
fectively  train  these  frontends,  we  introduce  a  data  selection 
technique  that  discovers  language  groups  from  an  available 
set  of  training  languages.  This  data  selection  method  reduces 
the  required  amount  of  training  data  and  training  time  by  ap¬ 
proximately  40%,  with  minimal  performance  degradation.  We 
present  speech  recognition  results  on  7  very  limited  language 
pack  (YLLP)  languages  from  the  second  option  period  of  the 
IARPA  Babel  program  using  multilingual  features  trained  on 
up  to  10  languages.  The  proposed  multilingual  features  provide 
up  to  15%  relative  improvement  over  baseline  acoustic  features 
on  the  VLLP  languages. 

Index  Terms:  Multilingual  features,  acoustic  models,  deep 
neural  networks,  low  resource  speech  recognition. 

1.  Introduction 

Although  acoustic  models  for  state-of-the-art  speech  recogni¬ 
tion  systems  are  typically  trained  on  several  hundred  hours  of 
task  specific  training  data,  in  low  resource  scenarios  only  a 
few  hours  of  annotated  training  data  are  often  available.  In 
these  settings,  it  is  possible  to  take  advantage  of  transcribed 
data  from  other  languages  to  build  multilingual  acoustic  mod¬ 
els  [1,2].  With  deep  neural  networks  (DNNs)  becoming  pop¬ 
ular  for  acoustic  modeling,  several  variants  of  these  networks 
have  been  proposed  for  speech  recognition  in  low  resource  set¬ 
tings  [3-15].  They  typically  fall  into  the  following  three  broad 
classes: 

(a)  Networks  that  use  a  common  phoneme  set  covering  all  the 
languages  in  the  training  set  to  train  a  multilingual  acoustic 
model  [3,4]. 

(b)  Networks  trained  with  multiple  language  specific  output 
layers  to  alleviate  the  burden  of  finding  a  common  multi¬ 
lingual  phoneme  set.  These  networks  are  first  trained  with 
separate  output  layers  for  each  language  in  the  training  set, 
and  then  fine-tuned  to  the  final  target  language  [6-8]. 

(c)  Networks  trained  as  described  above,  but  used  to  extract 
multilingual  bottleneck  features  for  subsequent  processing 
instead  of  directly  being  used  as  acoustic  models  [5,9, 10]. 

For  training  all  these  classes  of  networks,  it  is  useful  to 
determine  the  right  amount  of  multilingual  training  data  and 
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the  languages  that  contribute  most  to  training  effective  acous¬ 
tic  models  [13, 14].  In  this  paper,  we  investigate  an  approach 
to  guide  data  selection  for  training  multilingual  feature  front- 
ends,  in  the  spirit  of  the  class  (c)  models  described  above.  The 
proposed  data-driven  technique,  which  is  based  on  an  analysis 
of  phoneme  confusion  matrices,  allows  for  similarities  between 
languages  available  for  training  to  be  visualized.  By  requiring 
limited  amounts  («  3  hours)  of  transcribed  data  for  analysis, 
the  method  circumvents  the  need  to  transcribe  large  amounts  of 
data  for  training.  Only  candidate  languages  from  the  selected 
language  clusters  now  need  to  be  transcribed  for  training  the 
multilingual  feature  frontends.  Our  experiments  show  that  fron¬ 
tends  trained  on  only  the  selected  languages  can  perform  as  well 
as  frontends  trained  on  the  entire  available  data.  This  leads  to 
close  to  50%  reduction  in  the  amount  of  transcribed  data  and 
the  time  required  for  training  the  frontend. 

The  remainder  of  the  paper  is  organized  as  follows.  In  sec¬ 
tion  2,  we  describe  the  multilingual  feature  frontend  [15]  used  to 
produce  multilingual  representations  for  IBM’s  speech  recogni¬ 
tion  and  keyword  search  systems  used  in  the  Babel  [16]  Option 
Period  2  (OP2)  evaluation.  Although  this  multilingual  frontend 
can  be  trained  in  advance  of  the  evaluation  period,  training  the 
model  on  close  to  1000  hours  of  speech  from  10  languages  is 
time  consuming.  To  increase  the  efficacy  of  this  model,  we  in¬ 
vestigate  the  use  of  multilingual  data  sampling.  Section  3  de¬ 
scribes  the  proposed  data  selection  technique  and  its  application 
to  an  available  pool  of  10  languages  [17-29,29-33].  Section  4 
describes  experiments  and  results  using  the  multilingual  fron¬ 
tend  and  the  identified  language  clusters.  The  paper  concludes 
with  a  discussion  in  section  5. 

2.  The  Multilingual  Feature  Frontend 

The  feature  frontend  used  in  this  paper  employs  two  DNNs 
in  a  hierarchical  fashion  [15].  Similar  to  architectures  pro¬ 
posed  in  [5],  while  the  first  neural  network  in  the  hierarchy  is 
trained  on  acoustic  features  extracted  from  the  data,  the  second 
network  models  intermediate  multilingual  representations  ex¬ 
tracted  from  the  bottleneck  layer  of  the  first  network.  Both  the 
networks  are  trained  on  data  from  several  languages  by  using 
language- specific  output  layers,  instead  of  mapping  the  data  to 
a  common  phoneme  set.  The  final  output  of  this  feature  fron¬ 
tend  is  a  multilingual  representation  from  the  bottleneck  layer 
of  the  second  network. 

In  our  training  framework,  we  use  40-dimensional  log-Mel 
filterbank  features  spliced  together  with  a  context  of  ±5  frames 
as  input  to  the  first  NN.  The  80-dimensional  bottleneck  features 
extracted  from  the  first  network  are  then  used  as  features  for  the 
second  DNN.  The  context  of  these  multilingual  features  is  ex¬ 
panded  to  ±10  frames  but  is  then  subsampled  at  a  two  frame 
rate  to  produce  a  400-dimension  feature  vector.  Both  DNNs  use 
up  to  10  independent  output  softmax  layers  corresponding  to  10 
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Figure  1 :  Identification  of  language  clusters  using  scores  from  an  LID  system 


training  languages  used  in  the  Base  and  OP1  evaluation  periods 
of  the  Babel  program  [9].  These  languages  include  Assamese, 
Bengali,  Pashto,  Turkish,  Tagalog,  Vietnamese,  Haitian  Creole, 
Lao,  Tamil  and  Zulu.  By  sharing  fully  connected  hidden  layers 
across  all  languages,  while  this  architecture  learns  a  multilin¬ 
gual  representation,  it  also  has  an  advantage  of  not  requiring 
a  common  phoneme  set  that  covers  all  the  training  languages. 
Using  the  standard  error  back-propagation  for  minimizing  the 
cross-entropy  objective  function,  the  DNNs  are  trained  on  align¬ 
ments  produced  by  HMM-GMM  acoustic  models  trained  on 
each  language  separately. 

In  the  context  of  the  Babel  program,  training  a  multilingual 
feature  frontend  using  1000  hours  of  data  across  10  training  lan¬ 
guages  available  is  time  consuming.  It  would  hence  be  advan¬ 
tageous  to  train  a  similar-performing  network  on  significantly 
fewer  hours  of  speech.  Multilingual  data  selection  can  also  be 
beneficial  in  a  different  setting.  For  a  new  low  resource  lan¬ 
guage,  if  one  had  access  to  large  amounts  of  untranscribed  data 
from  several  other  languages,  it  would  be  cost  effective  to  know 
that  transcribing  a  certain  set  of  languages  is  more  important 
than  attempting  to  transcribe  all  the  languages  to  build  a  mul¬ 
tilingual  frontend.  In  the  next  section  we  show  how  language 
clusters  can  be  identified  with  up  to  3  hours  of  transcribed  data 
from  each  of  the  available  languages.  The  languages  falling  un¬ 
der  the  dominant  cluster  can  then  be  selected  as  candidates  for 
transcription  rather  than  blindly  transcribing  all  the  data. 

3.  Detecting  Language  Clusters 

To  detect  similarities  between  languages  and  subsequently  lan¬ 
guage  clusters,  we  investigate  the  use  of  a  data-driven  technique 
based  on  an  analysis  of  confusion  matrices.  These  confusion 
matrices  are  estimated  via  two  methods  described  below. 

3.1.  Language  clusters  using  scores  from  a  language  identi¬ 
fication  network 
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Figure  2:  Schematic  of  the  hierarchical  multilingual  feature 
frontend  using  selected  languages 

individual  language  to  obtain  global  language  scores.  The  com¬ 
bined  context  independent  phoneme  set  is  created  by  collecting 
context  independent  phonemes  across  all  languages  and  keep¬ 
ing  them  language  specific,  i.e.  without  merging  any  phonemes 
although  they  might  be  acoustically  similar. 

As  shown  in  Fig.  1  after  an  NN  based  LID  system  is  trained, 
feature  frames  of  each  language  are  passed  through  the  network 
to  first  derive  posteriors  of  context  independent  phonemes  of 
languages,  before  phonemes  of  each  individual  language  are 
collapsed  to  a  single  language  score.  After  averaging  the  lan¬ 
guage  scores  over  the  number  of  input  feature  frames  in  each 
language,  the  scores  are  then  used  to  populate  a  language  sim¬ 
ilarity  matrix.  The  language  similarity  matrix  is  further  used 
to  construct  a  graph  where  individual  nodes  correspond  to  lan¬ 
guages  and  connecting  arcs  are  weighted  by  scores  from  the 
language  similarity  matrix.  Spectral  clustering  is  then  applied 
to  this  graph  to  form  language  clusters,  by  solving  a  convex  re¬ 
laxation  of  the  normalized  graph  cut  problem  [34], 


In  [14],  to  find  similarities  between  languages,  a  language  iden¬ 
tification  approach  is  proposed.  This  technique  works  by  first 
training  a  shallow  neural  network  (NN)  to  predict  language  pos¬ 
terior  probabilities  and  then  averaging  the  posterior  scores  over 
frames.  For  a  set  of  languages  that  are  used  to  train  the  lan¬ 
guage  identification  (LID)  network,  pairs  of  languages  that  are 
close  to  each  other  are  shown  to  have  higher  predicted  poste¬ 
riors.  We  explore  this  technique  further  by  training  a  similar 
LID  network  that  discriminates  between  the  context  indepen¬ 
dent  phonemes  of  all  the  languages  we  wish  to  identify.  We 
then  combine  the  scores  corresponding  to  phonemes  of  each 


To  train  the  LID  network,  only  a  very  small  subset  of  the 
training  data  of  each  language  is  used.  In  our  experiments  we 
use  about  3  hours  of  transcribed  data  for  each  language  (2%  of 
the  available  data).  The  remaining  data  is  only  used  or  tran¬ 
scribed  if  it  belongs  to  a  dominant  language  cluster,  hence  sav¬ 
ing  on  the  cost  of  transcribing  large  data  sets.  For  the  10  Babel 
training  languages  in  hand,  we  train  an  LID  system  on  about 
30  hours  of  speech  using  about  3  hours  of  speech  from  each 
language.  A  network  with  3  hidden  layers  is  trained  to  discrim¬ 
inate  between  435  context  independent  phonemes,  which  are 
combined  during  test  time  to  produce  10  dimensional  language 
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Figure  3:  Identification  of  language  clusters  using  scores  from  individually  trained  neural  networks 


posteriors.  These  language  posteriors  are  then  averaged  over  all 
input  frames  and  are  used  to  populate  a  language  similarity  ma¬ 
trix.  After  constructing  a  language  graph  and  automatically  par¬ 
titioning  it,  we  pick  languages  in  the  dominant  cluster  to  train  a 
hierarchical  multilingual  feature  frontend  as  shown  in  Fig.  2. 

By  pooling  together  languages,  although  this  approach  can 
jointly  discriminative  between  languages,  we  observe  that  as 
the  number  of  the  languages  being  compared  increases,  the  lan¬ 
guage  scores  tend  to  spread  out  among  classes.  Empirically 
we  notice  that  for  fewer  than  3  languages  this  approach  can 
predict  strong  relationships.  However  for  settings  with  10  lan¬ 
guages,  the  final  scores  across  languages  are  often  uniformly 
spread  out,  limiting  the  discovery  of  strong  relationships  be¬ 
tween  languages.  A  second  limitation  of  this  approach  is  that 
the  LID  system  needs  to  be  retrained  every  time  a  new  language 
is  added  to  the  multilingual  pool  for  selection  since  the  LID  NN 
is  jointly  trained  across  all  languages.  To  alleviate  these  lim¬ 
itations,  we  investigate  how  individual  language  networks  can 
be  trained  and  combined  to  produce  similarity  scores  useful  for 
identifying  language  clusters. 

3.2.  Language  clusters  using  scores  from  individual  lan¬ 
guage  networks 

To  discover  language  clusters  from  individual  language  net¬ 
works,  we  begin  by  estimating  confusion  matrices  between  lan¬ 
guage  pairs.  To  create  a  confusion  matrix  between  two  lan¬ 
guages  CA  and  £&,  with  phoneme  sets  A  and  B ,  we  train  neural 
networks  on  both  languages.  A  network  trained  on  Ca  for  ex¬ 
ample,  estimates  posterior  probabilities  of  speech  sounds  in  A, 
conditioned  on  the  input  feature  vectors.  We  then  forward  pass 
the  data  in  Cb  through  the  trained  NN.  To  understand  the  rela¬ 
tionship  between  phonemes,  we  treat  the  phoneme  recognition 
system  as  a  discrete,  memory-less,  noisy  communication  chan¬ 
nel  with  the  phonemes  in  B  as  source  symbols  to  the  system. 
Using  the  recognized  phonemes  belonging  to  A  at  the  output 
of  the  recognizer  as  received  symbols,  confusion  matrices  that 
characterize  the  data  sets  are  then  computed. 

Each  time  a  feature  vector  corresponding  to  phoneme  bi  E 
B  is  passed  through  the  trained  NN,  posterior  probabilities  cor¬ 
responding  to  all  phonemes  in  set  A  are  obtained  at  the  out¬ 
put  of  the  NN.  We  treat  each  of  these  posterior  probabilities 
as  soft-counts  to  populate  a  phoneme  confusion  matrix  (CM). 
From  a  fully-populated  confusion  matrix  CM,  the  following 


counts  can  be  derived.  Entry  (i,j)  of  the  confusion  matrix  cor¬ 
responds  to  the  soft  count  aggregate  CM(i,  j)  of  the  total  num¬ 
ber  of  times  phoneme  bi  was  recognized  as  aj .  Marginal  count 
CM(i)  of  each  row  is  the  total  number  of  times  phoneme  bi 
occurred  in  the  task- specific  data.  Similarly  count  CM(j )  of 
each  column  is  the  total  number  of  times  phoneme  aj  of  the 
task-independent  data  set  was  recognized.  C  is  the  total  num¬ 
ber  of  counts  in  the  confusion  matrix. 

Given  such  a  CM,  a  useful  information  theoretic  quantity 
that  can  be  used  to  quantify  relationships  between  the  phonemes 
of  each  language  is  the  empirical  pointwise  mutual  informa¬ 
tion  [35].  In  [36],  the  use  of  this  quantity  in  conjunction  with 
confusion  matrices  has  been  shown.  For  an  input  alphabet  A 
and  output  alphabet  B ,  using  the  count  based  confusion  matrix, 
the  empirical  pointwise  mutual  information  between  two  sym¬ 
bols  di  from  A  and  bj  from  B  is  expressed  as 

iAB(ai,bj)=  log^gA  CD 

where  Nij  is  the  number  of  times  the  joint  event  ( A  m 
di,B  =  bj)  occurs  and  Ni,  Nj  are  marginal  counts  N \j 

and  J2i  Nij.  N  is  the  total  number  of  events. 

Using  our  soft  count  based  confusion  matrix  between  two 
phoneme  sets  A  and  B ,  we  similarly  define  the  empirical  point- 
wise  mutual  information  between  phoneme  pairs  (a*,  bj)  as 


Iab  {p>i  5  bj ) 


log  CM(iJ).C 


(2) 


using  quantities  defined  earlier. 

Once  an  NN  is  trained  on  each  language,  a  per-phoneme 
mutual  information  (MI)  matrix  for  every  language  pair  can 
then  be  computed.  Entry  (i,j)  for  one  such  matrix  contains  the 
MI  score  between  phoneme  i  of  the  first  language  and  phoneme 
j  of  the  second  language.  We  then  compute  the  Frobenius  norm 
of  the  matrix  and  normalize  it  with  the  total  number  of  entries 
to  arrive  at  a  global  MI  score  between  the  two  languages. 

For  each  of  the  Babel  training  set  of  languages  at  hand,  we 
train  a  shallow  2-layer  NN  with  3  hours  of  transcribed  data  us¬ 
ing  context  independent  phonemes.  After  these  NNs  have  been 
trained,  a  10  x  10  MI  language  similarity  matrix  over  the  10  lan¬ 
guages  is  computed.  The  (i,  j)-th  entry  of  this  matrix  gives 
the  information  theoretic  similarity  between  languages  i  and  j 
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with  a  higher  score  signifying  greater  similarity.  As  shown  in 
Fig.  3,  the  language  similarity  matrix  is  further  used  to  con¬ 
struct  a  language  graph  from  which  clusters  are  identified  as 
described  earlier.  Languages  in  the  dominant  cluster  are  then 
used  to  train  a  hierarchical  multilingual  feature  frontend.  Al¬ 
though  languages  are  not  jointly  discriminated  in  this  approach 
of  constructing  language  similarities  using  individual  language 
nets,  pairwise  language  similarities  are  enhanced  by  converting 
the  posterior  based  scored  into  mutual  information  based  scores. 
This  approach  also  has  an  advantage  of  being  able  to  scale  more 
easily  as  no  model  is  jointly  trained  on  all  languages.  The  lan¬ 
guage  similarity  matrix  however  needs  to  be  re-estimated  (a  new 
row/column  entry  needs  to  be  estimated  for  each  new  language). 

After  the  application  of  the  clustering  algorithm  on  the  LID 
based  graph,  we  discovered  two  dominant  clusters  -  {Pashto, 
Tagalog,  Turkish,  Bengali,  Assamese,  Zulu}  and  {Lao,  Haitian 
Creole,  Tamil,  Vietnamese}.  The  graph  based  on  scores  from 
individual  languages  on  the  other  hand  is  clustered  into  {Pashto, 
Tagalog,  Haitian  Creole,  Lao,  Tamil  and  Zulu}  and  {Turkish, 
Bengali,  Assamese,  Vietnamese}.  We  hypothesize  that  the  6 
language  clusters  discovered  by  the  proposed  techniques  will 
be  a  useful  representative  set  for  extracting  multilingual  fea¬ 
tures.  Since  the  technique  has  nearly  halved  the  amount  of 
training  data,  the  multilingual  frontend  training  time  is  also  re¬ 
duced  by  close  to  50%.  If  none  of  the  10  languages  were  fully 
transcribed,  with  just  2%  of  the  data  (3  hours  x  10),  this  tech¬ 
niques  suggests  that  only  data  from  6  languages  needs  to  be 
transcribed  to  create  an  effective  multilingual  frontend.  This  re¬ 
sults  in  a  40%  reduction  in  the  data  transcription  and  processing 
effort.  In  the  next  section,  we  evaluate  the  effectiveness  of  the 
proposed  technique. 

4.  Experiments  and  Results 

To  evaluate  the  performance  of  multilingual  frontends  trained 
on  various  language  groups,  we  use  features  extracted  from 
these  frontends  on  6  VLLP  languages  -  Swahili,  Kurmanji,  Ce- 
buano,  Kazakh,  Telugu  and  Lithuanian,  each  with  just  3  hours 
of  transcribed  data.  Word  Error  Rates  (WER  %)  are  reported 
on  a  3  hour  tuning  set.  We  start  by  training  baseline  speaker- 
independent  (SI)  acoustic  models  on  13-dimension  PLP  fea¬ 
tures  with  speaker-based  mean  and  variance  normalization.  A 
context  of  9  frames  is  spliced  together  and  projected  to  a  40- 
dimensional  feature  space  using  linear  discriminant  analysis 
(LDA),  and  the  class-conditional  distributions  are  further  diag¬ 
onalized  using  a  global,  semi-tied  covariance  (STC)  transform. 

In  the  following  SI  multilingual  step,  the  PLP+LDA+STC 
features  described  above  are  fused  with  ML  features,  trans¬ 
formed  by  LDA  and  STC,  and  then  used  as  input  for  a  two¬ 
fold  DNN  pipeline.  In  each  fold,  a  new  alignment  is  gener¬ 
ated  with  the  current  model  and  a  new  decision  tree  is  built  on 
top  of  the  alignment.  The  DNN  training  procedure  comprises  : 
(1)  discriminative  layer- wise  pre-training  and  (2)  training  with 
cross-entropy  criterion.  The  DNN  comprises  3  hidden  layers  of 
1024  ReLU  units,  followed  by  one  1024-unit  sigmoid  layer  and 
a  1000-unit  softmax  layer.  The  baseline  language  models  (LM) 
are  Kneser-Ney  (KN)-smoothed  bigram  models  with  a  5K  vo¬ 
cabulary  size.  All  the  acoustic  models  are  hybrid  models  trained 
using  the  IBM  Attila  speech  recognition  toolkit  [37]. 

Table  1  shows  the  Word  Error  Rates  (WER  %)  with  the 
baseline  acoustic  features  (PLP)  and  multilingual  features  ex¬ 
tracted  from  a  feature  frontend  trained  on  10  languages  (ML- 
10)  [15].  For  all  the  languages,  multilingual  features  provide 
significant  gains  (up  to  15%  relative  improvements)  over  the 


Table  1:  WER  (%)  using  from  various  multilingual  frontends. 


Language 

PLP 

ML- 10 

SMP 

RND 

LID 

IL 

Swahili 

75.2 

66.0 

67.5 

68.0 

67.6 

66.8 

Kurmanji 

84.1 

78.2 

79.5 

79.5 

79.1 

79.2 

Tok  Pisin 

64.8 

53.8 

56.2 

57.1 

56.7 

54.8 

Cebuano 

78.1 

70.5 

72.1 

72.0 

71.9 

71.3 

Kazakh 

79.1 

72.9 

74.0 

74.5 

73.7 

73.5 

Telugu 

87.6 

82.3 

83.7 

84.4 

83.6 

82.9 

Lithuanian 

73.0 

65.9 

67.4 

67.4 

67.2 

67.2 

basic  features.  In  the  next  set  of  experiments,  we  train  a  set  of 
multilingual  frontends  on  the  clusters  identified  using  the  two 
techniques  described  earlier.  These  frontends  have  training  data 
sampled  from  the  full  training  set  of  10  languages  in  two  differ¬ 
ent  ways.  They  include  - 

(a)  A  frontend  on  language  clusters  identified  using  the  LID 
based  NN  (LID) 

(b)  A  frontend  on  language  clusters  identified  using  mutual  in¬ 
formation  scores  from  individually  trained  NNs  (IL) 

Table  1  shows  the  performance  of  these  feature  frontends  in 
comparison  with  conventional  PLP  features  and  multilingual 
features  from  various  frontends  -  (i)  trained  on  all  the  languages 
(ML- 10),  (ii)  trained  using  up  to  50%  of  data,  uniformly  sam¬ 
pled  across  all  the  10  languages  (SMP)  [15]  and  (iii)  trained  on 
5  randomly  selected  languages  -  Zulu,  Turkish,  Haitian  Creole, 
Tagalog  and  Assamese  (RND).  The  following  interesting  obser¬ 
vations  can  be  drawn  from  these  results  - 

(a)  With  only  50%  of  the  data,  the  frontends  trained  on  the 
discovered  clusters  perform  almost  as  well  as  the  frontend 
trained  on  all  of  the  data. 

(b)  In  most  cases  the  frontend  trained  using  scores  from  in¬ 
dividually  trained  NNs  performs  better  than  the  frontend 
trained  using  scores  from  the  LID  based  NN.  This  probably 
confirms  an  earlier  hypothesis  that  the  LID  based  system 
cannot  perform  well  as  the  number  of  languages  increases. 

(c)  Frontends  trained  on  the  identified  language  clusters  almost 
always  perform  better  than  frontends  trained  on  random  se¬ 
lection  of  data. 

(d)  Since  all  these  models  use  only  60%  of  the  data,  this  result 
highlights  the  need  for  selecting  the  right  set  of  languages 
for  training.  The  training  time  for  the  proposed  frontends 
is  around  10  days  compared  to  21  days  for  the  ML- 10  fron¬ 
tends  [15].  There  is  clearly  hence  a  significant  reduction  in 
training  time  with  the  proposed  technique  as  well. 

5.  Conclusions 

In  this  paper  we  have  introduced  a  simple  technique  to  perform 
data  selection  across  languages  for  building  multilingual  fron¬ 
tends  using  confusion  matrices.  With  the  proposed  technique 
we  identify  language  clusters  and  show  that  models  trained  on 
selected  candidate  languages  can  produce  very  comparable  per¬ 
formances  with  significantly  less  training  time  and  data  (close 
to  50%  reduction  in  both  training  time  and  data).  In  this  work 
we  have  assumed  that  the  frontend  is  built  independent  of  the 
final  target  language.  It  will  be  useful  to  investigate  as  future 
work,  how  languages  can  be  selected  based  on  prior  knowledge 
of  the  final  target  language. 
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